{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Otto商品分类——RBF 核SVM\n",
    "\n",
    "我们以Kaggle 2015年举办的Otto Group Product Classification Challenge竞赛数据为例，分别调用\n",
    "缺省参数SVC、\n",
    "SVC + GridSearchCV进行参数调优。\n",
    "\n",
    "Otto数据集是著名电商Otto提供的一个多类商品分类问题，类别数=9. 每个样本有93维数值型特征（整数，表示某种事件发生的次数，已经进行过脱敏处理）。 竞赛官网：https://www.kaggle.com/c/otto-group-product-classification-challenge/data\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 首先 import 必要的模块\n",
    "import pandas as pd \n",
    "import numpy as np\n",
    "import warnings\n",
    "warnings.filterwarnings('ignore')\n",
    "\n",
    "from sklearn.model_selection import GridSearchCV\n",
    "\n",
    "#竞赛的评价指标为logloss\n",
    "#from sklearn.metrics import log_loss  \n",
    "\n",
    "#SVM虽然也支持输出各类的概率，但这需要额外的计算费用，且得到的概率也不保证是合法的概率，\n",
    "#所以在这个例子中我们用正确率accuracy_score作为模型选择的度量，最后在最佳超参数情况下再训练模型，得到概率表示\n",
    "from sklearn.metrics import accuracy_score\n",
    "\n",
    "from matplotlib import pyplot as plt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 读取数据 "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>pregnants</th>\n",
       "      <th>Plasma_glucose_concentration</th>\n",
       "      <th>blood_pressure</th>\n",
       "      <th>Triceps_skin_fold_thickness</th>\n",
       "      <th>serum_insulin</th>\n",
       "      <th>BMI</th>\n",
       "      <th>Diabetes_pedigree_function</th>\n",
       "      <th>Age</th>\n",
       "      <th>Target</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.639947</td>\n",
       "      <td>0.866045</td>\n",
       "      <td>-0.031990</td>\n",
       "      <td>0.670643</td>\n",
       "      <td>-0.181541</td>\n",
       "      <td>0.166619</td>\n",
       "      <td>0.468492</td>\n",
       "      <td>1.425995</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>-0.844885</td>\n",
       "      <td>-1.205066</td>\n",
       "      <td>-0.528319</td>\n",
       "      <td>-0.012301</td>\n",
       "      <td>-0.181541</td>\n",
       "      <td>-0.852200</td>\n",
       "      <td>-0.365061</td>\n",
       "      <td>-0.190672</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1.233880</td>\n",
       "      <td>2.016662</td>\n",
       "      <td>-0.693761</td>\n",
       "      <td>-0.012301</td>\n",
       "      <td>-0.181541</td>\n",
       "      <td>-1.332500</td>\n",
       "      <td>0.604397</td>\n",
       "      <td>-0.105584</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>-0.844885</td>\n",
       "      <td>-1.073567</td>\n",
       "      <td>-0.528319</td>\n",
       "      <td>-0.695245</td>\n",
       "      <td>-0.540642</td>\n",
       "      <td>-0.633881</td>\n",
       "      <td>-0.920763</td>\n",
       "      <td>-1.041549</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>-1.141852</td>\n",
       "      <td>0.504422</td>\n",
       "      <td>-2.679076</td>\n",
       "      <td>0.670643</td>\n",
       "      <td>0.316566</td>\n",
       "      <td>1.549303</td>\n",
       "      <td>5.484909</td>\n",
       "      <td>-0.020496</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   pregnants  Plasma_glucose_concentration  blood_pressure  \\\n",
       "0   0.639947                      0.866045       -0.031990   \n",
       "1  -0.844885                     -1.205066       -0.528319   \n",
       "2   1.233880                      2.016662       -0.693761   \n",
       "3  -0.844885                     -1.073567       -0.528319   \n",
       "4  -1.141852                      0.504422       -2.679076   \n",
       "\n",
       "   Triceps_skin_fold_thickness  serum_insulin       BMI  \\\n",
       "0                     0.670643      -0.181541  0.166619   \n",
       "1                    -0.012301      -0.181541 -0.852200   \n",
       "2                    -0.012301      -0.181541 -1.332500   \n",
       "3                    -0.695245      -0.540642 -0.633881   \n",
       "4                     0.670643       0.316566  1.549303   \n",
       "\n",
       "   Diabetes_pedigree_function       Age  Target  \n",
       "0                    0.468492  1.425995       1  \n",
       "1                   -0.365061 -0.190672       0  \n",
       "2                    0.604397 -0.105584       1  \n",
       "3                   -0.920763 -1.041549       0  \n",
       "4                    5.484909 -0.020496       1  "
      ]
     },
     "execution_count": 49,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 读取数据\n",
    "#由于用tf_idf数据运行特别慢，所以这里用课程数据代替来做\n",
    "train = pd.read_csv(\"FE_pima-indians-diabetes.csv\")\n",
    "train.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "#train.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 准备数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 将类别字符串变成数字\n",
    "# drop ids and get labels\n",
    "y_train = train['Target']   #形式为Class_x\n",
    "X_train = train.drop([\"Target\"], axis=1)\n",
    "\n",
    "#保存特征名字以备后用（可视化）\n",
    "feat_names = X_train.columns \n",
    "\n",
    "#sklearn的学习器大多之一稀疏数据输入，模型训练会快很多\n",
    "from scipy.sparse import csr_matrix\n",
    "X_train = csr_matrix(X_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 训练样本6w+，交叉验证太慢，用train_test_split估计模型性能\n",
    "# SVM对大样本数据集支持不太好\n",
    "from sklearn.model_selection import train_test_split\n",
    "X_train_part, X_val, y_train_part, y_val = train_test_split(X_train, y_train, train_size = 0.8,random_state = 0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(614, 8)\n"
     ]
    }
   ],
   "source": [
    "print (X_train_part.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 模型训练"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### RBF核SVM正则参数调优\n",
    "\n",
    "RBF核是SVM最常用的核函数。\n",
    "RBF核SVM 的需要调整正则超参数包括C（正则系数，一般在log域（取log后的值）均匀设置候选参数）和核函数的宽度gamma\n",
    "C越小，决策边界越平滑； \n",
    "gamma越小，决策边界越平滑。\n",
    "\n",
    "采用交叉验证，网格搜索步骤与Logistic回归正则参数处理类似，在此略。\n",
    "\n",
    "这里我们用校验集（X_val、y_val）来估计模型性能"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.svm import SVC"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {},
   "outputs": [],
   "source": [
    "def fit_grid_point_RBF(C, gamma, X_train, y_train, X_val, y_val):\n",
    "    \n",
    "    # 在训练集是那个利用SVC训练\n",
    "    SVC3 =  SVC( C = C, kernel='rbf', gamma = gamma)\n",
    "    SVC3 = SVC3.fit(X_train, y_train)\n",
    "    \n",
    "    # 在校验集上返回accuracy\n",
    "    accuracy = SVC3.score(X_val, y_val)\n",
    "    \n",
    "    print(\"C= {} and gamma = {}: accuracy= {} \" .format(C, gamma, accuracy))\n",
    "    return accuracy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {},
   "outputs": [],
   "source": [
    "accuracy_s = np.matrix(np.zeros(shape=(5, 3)), float)\n",
    "gamma_s = np.logspace(-1, 1, 3)  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "C= 0.1 and gamma = 0.1: accuracy= 0.7922077922077922 \n",
      "C= 0.1 and gamma = 1.0: accuracy= 0.6948051948051948 \n",
      "C= 0.1 and gamma = 10.0: accuracy= 0.6948051948051948 \n"
     ]
    }
   ],
   "source": [
    "oneC = 0.1\n",
    "\n",
    "for j, gamma in enumerate(gamma_s):\n",
    "    accuracy_s[0,j] = fit_grid_point_RBF(oneC, gamma, X_train_part, y_train_part, X_val, y_val)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "C= 1 and gamma = 0.1: accuracy= 0.7792207792207793 \n",
      "C= 1 and gamma = 1.0: accuracy= 0.7467532467532467 \n",
      "C= 1 and gamma = 10.0: accuracy= 0.6948051948051948 \n"
     ]
    }
   ],
   "source": [
    "oneC = 1\n",
    "\n",
    "for j, gamma in enumerate(gamma_s):\n",
    "    accuracy_s[1,j] = fit_grid_point_RBF(oneC, gamma, X_train_part, y_train_part, X_val, y_val)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "C= 10 and gamma = 0.1: accuracy= 0.7857142857142857 \n",
      "C= 10 and gamma = 1.0: accuracy= 0.7272727272727273 \n",
      "C= 10 and gamma = 10.0: accuracy= 0.6948051948051948 \n"
     ]
    }
   ],
   "source": [
    "oneC = 10\n",
    "\n",
    "for j, gamma in enumerate(gamma_s):\n",
    "    accuracy_s[2,j] = fit_grid_point_RBF(oneC, gamma, X_train_part, y_train_part, X_val, y_val)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "C= 100 and gamma = 0.1: accuracy= 0.7662337662337663 \n",
      "C= 100 and gamma = 1.0: accuracy= 0.7337662337662337 \n",
      "C= 100 and gamma = 10.0: accuracy= 0.6948051948051948 \n"
     ]
    }
   ],
   "source": [
    "oneC = 100\n",
    "\n",
    "for j, gamma in enumerate(gamma_s):\n",
    "    accuracy_s[3,j] = fit_grid_point_RBF(oneC, gamma, X_train_part, y_train_part, X_val, y_val)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "C= 1000 and gamma = 0.1: accuracy= 0.7337662337662337 \n",
      "C= 1000 and gamma = 1.0: accuracy= 0.7337662337662337 \n",
      "C= 1000 and gamma = 10.0: accuracy= 0.6948051948051948 \n"
     ]
    }
   ],
   "source": [
    "oneC = 1000\n",
    "\n",
    "for j, gamma in enumerate(gamma_s):\n",
    "    accuracy_s[4,j] = fit_grid_point_RBF(oneC, gamma, X_train_part, y_train_part, X_val, y_val)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 76,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "C= 0.1 and gamma = 0.1: accuracy= 0.7922077922077922 \n",
      "C= 0.1 and gamma = 1.0: accuracy= 0.6948051948051948 \n",
      "C= 0.1 and gamma = 10.0: accuracy= 0.6948051948051948 \n",
      "C= 1.0 and gamma = 0.1: accuracy= 0.7792207792207793 \n",
      "C= 1.0 and gamma = 1.0: accuracy= 0.7467532467532467 \n",
      "C= 1.0 and gamma = 10.0: accuracy= 0.6948051948051948 \n",
      "C= 10.0 and gamma = 0.1: accuracy= 0.7857142857142857 \n",
      "C= 10.0 and gamma = 1.0: accuracy= 0.7272727272727273 \n",
      "C= 10.0 and gamma = 10.0: accuracy= 0.6948051948051948 \n",
      "C= 100.0 and gamma = 0.1: accuracy= 0.7662337662337663 \n",
      "C= 100.0 and gamma = 1.0: accuracy= 0.7337662337662337 \n",
      "C= 100.0 and gamma = 10.0: accuracy= 0.6948051948051948 \n",
      "C= 1000.0 and gamma = 0.1: accuracy= 0.7337662337662337 \n",
      "C= 1000.0 and gamma = 1.0: accuracy= 0.7337662337662337 \n",
      "C= 1000.0 and gamma = 10.0: accuracy= 0.6948051948051948 \n"
     ]
    }
   ],
   "source": [
    "#需要调优的参数\n",
    "C_s = np.logspace(-1, 3, 5)# logspace(a,b,N)把10的a次方到10的b次方区间分成N份 \n",
    "gamma_s = np.logspace(-1, 1, 3)    \n",
    "\n",
    "accuracy_s = np.matrix(np.zeros(shape=(5, 3)), float)\n",
    "for i, oneC in enumerate(C_s):\n",
    "    for j, gamma in enumerate(gamma_s):\n",
    "         #print(i,j)\n",
    "        tmp = fit_grid_point_RBF(oneC, gamma, X_train_part, y_train_part, X_val, y_val)\n",
    "        accuracy_s[i,j]=tmp"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "从上述结果会发现，gamma参数非常重要(当gamma=0.01或gamma=100时性能很差),非线性模型比线性模型性能好（注意我们这里只用了tfidf特征）。\n",
    "但速度慢了不是一点半点(sklearn建议核方法SVM样本数不超过10000)\n",
    "可以考虑将训练样本分为多个子集，每个子集训练一个RBF核SVM模型，最后多个模型融合的结果的到最终模型（训练速度加快，但测试可能更慢）\n",
    "\n",
    "### C= 0.1 and gamma = 0.1: accuracy= 0.7922077922077922 \n",
    "C= 0.1 and gamma = 1.0: accuracy= 0.6948051948051948 \n",
    "\n",
    "C= 0.1 and gamma = 10.0: accuracy= 0.6948051948051948 \n",
    "\n",
    "C= 1.0 and gamma = 0.1: accuracy= 0.7792207792207793 \n",
    "\n",
    "C= 1.0 and gamma = 1.0: accuracy= 0.7467532467532467 \n",
    "\n",
    "C= 1.0 and gamma = 10.0: accuracy= 0.6948051948051948 \n",
    "\n",
    "C= 10.0 and gamma = 0.1: accuracy= 0.7857142857142857 \n",
    "\n",
    "C= 10.0 and gamma = 1.0: accuracy= 0.7272727272727273\n",
    "\n",
    "C= 10.0 and gamma = 10.0: accuracy= 0.6948051948051948\n",
    "\n",
    "C= 100.0 and gamma = 0.1: accuracy= 0.7662337662337663 \n",
    "\n",
    "C= 100.0 and gamma = 1.0: accuracy= 0.7337662337662337\n",
    "\n",
    "C= 100.0 and gamma = 10.0: accuracy= 0.6948051948051948 \n",
    "\n",
    "C= 1000.0 and gamma = 0.1: accuracy= 0.7337662337662337 \n",
    "\n",
    "C= 1000.0 and gamma = 1.0: accuracy= 0.7337662337662337 \n",
    "\n",
    "C= 1000.0 and gamma = 10.0: accuracy= 0.6948051948051948 \n",
    "### 由于数据选择的不同，这里的最优超参数C和gamma的取值为0.1和0.1，返回的模型评估正确率最高为 0.7922077922077922 ，由于用tf_idf的数据运行结果太慢，没运行出结果，所以采用的课程作业的数据来进行的rbf核的超参数调优的。\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## 找到最佳参数后，用全体训练数据训练模型"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# SVC训练SVC，支持概率输出\n",
    "Best_C = 100\n",
    "Best_gamma = 1.0\n",
    "\n",
    "SVC4 =  SVC( C = Best_C, kernel='rbf', gamma = Best_gamma, probability=True)\n",
    "SVC4.fit(X_train, y_train)\n",
    "\n",
    "#保持模型，用于后续测试\n",
    "import cPickle\n",
    "cPickle.dump(SVC4, open(\"Otto_RBF_SVC.pkl\", 'wb'))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
